Internal validation of Automated Visual Evaluation (AVE) on smartphone images for cervical cancer screening in a prospective study in Zambia

Abstract Objectives Visual inspection with acetic acid (VIA) is a low‐cost approach for cervical cancer screening used in most low‐ and middle‐income countries (LMICs) but, similar to other visual tests, is subjective and requires sustained training and quality assurance. We developed, trained, and validated an artificial‐intelligence‐based “Automated Visual Evaluation” (AVE) tool that can be adapted to run on smartphones to assess smartphone‐captured images of the cervix and identify precancerous lesions, helping augment VIA performance. Design Prospective study. Setting Eight public health facilities in Zambia. Participants A total of 8204 women aged 25–55. Interventions Cervical images captured on commonly used low‐cost smartphone models were matched with key clinical information including human immunodeficiency virus (HIV) and human papillomavirus (HPV) status, plus histopathology analysis (where applicable), to develop and train an AVE algorithm and evaluate its performance for use as a primary screen and triage test for women who are HPV positive. Main Outcome Measures Area under the receiver operating curve (AUC); sensitivity; specificity. Results As a general population screening tool for cervical precancerous lesions, AVE identified cases of cervical precancerous and cancerous (CIN2+) lesions with high performance (AUC = 0.91, 95% confidence interval [CI] = 0.89–0.93), which translates to a sensitivity of 85% (95% CI = 81%–90%) and specificity of 86% (95% CI = 84%–88%) based on maximizing the Youden's index. This represents a considerable improvement over naked eye VIA, which as per a meta‐analysis by the World Health Organization (WHO) has a sensitivity of 66% and specificity of 87%. For women living with HIV, the AUC of AVE was 0.91 (95% CI = 0.88–0.93), and among those testing positive for high‐risk HPV types, the AUC was 0.87 (95% CI = 0.83–0.91). Conclusions These results demonstrate the feasibility of utilizing AVE on images captured using a commonly available smartphone by nurses in a screening program, and support our ongoing efforts for moving to more broadly evaluate AVE for its clinical sensitivity, specificity, feasibility, and acceptability across a wider range of settings. Limitations of this study include potential inflation of performance estimates due to verification bias (as biopsies were only obtained from participants with visible aceto‐white cervical lesions) and due to this being an internal validation (the test data, while independent from that used to develop the algorithm was drawn from the same study).


1,2
When screenings are conducted, Visual Inspection with Acetic Acid (VIA) is the most commonly used screening tool for the detection of precancerous lesions; however, it is a relatively inaccurate method with significant heterogeneity of performance across providers, with sensitivity ranging from 22% to 91% and specificity ranging from 47% to 99% in published studies.
3 Widespread use of VIA is largely due to its low cost and the perceived simplicity with which it can be administered, even in rural areas.It also facilitates same-visit treatment of precancerous lesions.For these reasons, it has been integrated into many national cervical cancer screening efforts.
The 2021 World Health Organization (WHO) cervical cancer screening and treatment guideline recommends the use of human papillomavirus (HPV) DNA testing as a primary screening method where feasible.With a negative predictive value close to 100%, 4 a negative HPV DNA test result offers considerable assurance that a woman will not soon develop precancerous lesions, lengthening the intervals for when women need to return for re-screening.Women can also collect their own samples through a vaginal swab, avoiding the need for a pelvic exam with a speculum.However, HPV tests are not currently affordable for widespread use in most low-income countries, and practical point-of-care HPV testing solutions are not yet available, posing major barriers to scale-up and sustainability.
VIA can be used by providers to determine whether women who test HPV positive from both the general population and those living with HIV should be treated and/or what type of treatment is appropriate.When using the "screen and treat" approach, WHO recommends treating all women who test HPV positive.In such instances VIA can be used to determine the most appropriate treatment, depending on the characteristics of the cervical transformation zone and cervical precancerous lesion, if the latter is visible.When using the "screen, triage, and treat" approach, WHO recommends triaging all women who test HPV positive with a visual test, such as VIA, to determine if treatment is mandated and what type is appropriate.
For these reasons, visual evaluation will remain a fundamental component of national screening and treatment programs in LMICs for the foreseeable future, especially as local governments and their global public health partners work to scale up secondary prevention services to accelerate the elimination of .cervical cancer as a public health problem.Still, for visual evaluation to achieve its intended impact at sustained scale, innovation is needed to improve accuracy and reliability.
A proposed screening method termed Automated Visual Evaluation (AVE), which uses a machine learning algorithm to analyze an image of the cervix and provides a result indicating whether precancerous lesions are likely present or absent, has the potential to fill this gap.To use AVE, a provider takes a picture of the cervix using a standard smartphone, runs the algorithm on the mobile phone processor, and then receives a 'positive' or 'negative' result in less than one minute.AVE is intended as a decision-making aid to assist providers conducting VIA.
Results published in 2019 of an AVE algorithm trained and tested on data from a longitudinal study in Costa Rica demonstrated the proof of principle for AVE 5 with an area under the curve (AUC) of 0.91, which translated into an estimated sensitivity of 97.7% and specificity of 84% in identifying precancerous lesions and cancer (cervical intraepithelial neoplasia grade 2 or greater; "CIN2+") among women 25-49 years old.The accuracy of the AVE algorithm proved superior to clinical opinion or conventional cytology; however, generalizability of the Guanacaste study is limited by the relative homogeneity of the dataset.All women tested were from Costa Rica, which has a significantly lower HIV prevalence than other LMICs, particularly in Africa.Further, the Guanacaste analysis was performed using high-quality images of standard angle, light, and focus, captured by a small team of highly trained nurses using a cerviscope-a fixed-focus, ring-lit film camera that has since been discontinued.To be used at scale, an AVE algorithm would need to work on a more manageable and affordable device, and on images captured by health workers with a range of experience and training levels.This article describes the results of our effort to adapt AVE for use on a commercially available and affordable smartphone, with the machine learning training, validation, and testing of AVE using smartphone images collected by providers from women in Zambia.An approach combining AVE and HPV genotyping, based on a subset of the data from this same study, was previously published 6 .

Clinical Methods
Our study was nested within Zambia's public sector cervical cancer prevention service platform.More than 8,000 women aged 25-55 were prospectively recruited from among the population of women attending one of eight Ministry of Health public health facilities in Lusaka for routine care from October 2019 through December 2021.Zambia was chosen as the site for this study based on the presence of sophisticated clinical and patient management infrastructure-the product of a fifteen-year investment by the Zambia Ministry of Health (MOH), local partners, and external donors, including the US President's Emergency Plan for AIDS Relief (PEPFAR).
In 2005, the VIA screening method was launched in three government-operated primary healthcare clinics co-housed with HIV/AIDS programs funded by PEPFAR, in the capital city of Lusaka.The study was designed to collect at least 750 histopathology-confirmed patient images of precancer, as this was judged to be a sufficient target for machine learning training and validation experiments.
Because precancerous lesions are relatively rare, nesting the study in nurse-led clinics in which same-day "screen-treat" services included electrical excision procedures (LEEP/LLETZ) allowed us to meet the target number of CIN2+ images.In the study, more than 8,000 women were prospectively recruited from among a population of women receiving routine care from one of eight public health facilities, seven in Lusaka and one in Kitwe, during the period of October 2019 through December 2021.The women were eligible to participate if they were aged 25-55 years (age range recommended for cervical cancer screening by the MOH) and able to understand and willing to sign a written informed consent form.Women were excluded if they were pregnant or had a history of hysterectomy, trachelectomy, or treatment of cervical precancer or cancer.Women were enrolled consecutively at each site.
After eligibility and consent were confirmed, study participants were offered HIV and HPV testing.
Following HPV sample collection using the COPAN FLOQSwab, cervical cancer screening was performed using 'digital cervicography' (DC) as an adjunct to VIA in accordance with guidelines established for the government-operated 'screen and treat' clinics.For DC-VIA, the cervix was lavageed with a cotton swab soaked in 5% acetic acid by nurse-providers.After a waiting period of 2-3 minutes, an image of the cervix was obtained using a digital camera, then magnified to (1) identify the transformation zone (TZ) type, (2) determine the presence and characteristics of aceto-white lesions, and (3) decide the appropriate form of management.The digital camera cervical image was then shown to and discussed with the participant along with the recommended management plan.
For this study, with a goal to train the AVE algorithm only and not for informing clinical care decisions, additional images of the cervix (typically three images per participant) were obtained with a smartphone in quick succession after the digital camera image was captured.Two smartphone models, the Samsung Galaxy J8 and Samsung Galaxy A21s, readily accessible and affordable in Zambia at the time of data collection, were used to collect these images.A commercially available smartphone app-Box.com-was used to capture and store the images on the smartphones.Typically, only one model of smartphone was used for each participant, but some participants had images captured by providers using both models during the transition period while switching phone models.When lesions were identified, participants were treated in line with the Zambia MOH cervical cancer screening guidelines.Specifically: 1.
Participants whose transformation zones (TZ) and any aceto-white lesions, if present, were purely ectocervical (TZ type l), and occupied less than three-fourth of the transformation zone and were without features of invasive cancer, were offered ablative therapy using thermal ablation, performed using the Liger handheld cordless coagulator.For participants whose lesions were suspicious for cancer, a punch biopsy was performed.
As an additional study procedure, a punch biopsy was performed prior to ablation on all women with aceto-white lesions and eligible for ablative therapy.
For our study reference, we leveraged cervical histopathology to confirm the presence of cervical precancerous lesions.Biopsy samples were taken from all women who screened positive on digital cervicography (DC) or VIA.Histopathologic analysis was performed on all tissue (punch biopsy and LEEP/LLETZ)) specimens, with results reported as normal, benign lesion, CIN1, CIN2, CIN3, micro-invasive cancer, or invasive cancer.
Study participants found to have invasive cancer on histopathology were referred to the Gynecologic Oncology Unit of University Teaching Hospitals (UTHs) -Women and Newborn Hospital in Lusaka for appropriate staging and treatment.For all study participants, screening (and treatment, if applicable) was followed up with routine screening services in the same governmentoperated clinics, per the routine standard of care.
HPV testing was performed with the Beckton Dickinson (BD) Viper LT system, using the BD Onclarity™ HPV Assay as per manufacturer's specifications and standardized performance instructions.The BD

Machine Learning Methods
Machine learning leverages the abilities of computers to analyze large datasets and draw conclusions from patterns.To train an AVE algorithm to recognize cervical precancer cases accurately, thousands of images representing control status or histopathology-confirmed case status were presented to the algorithm.To diagnose cervical precancer and cancer, histopathology is considered the gold standard test.The inputs to train our AVE model included the set of cervical images and the ground truth label for each image.The ground truth label was a binary input -case vs. control -with case representing CIN2, CIN3, or cancer ('CIN2+') and control representing images that had a histopathology result less severe than CIN2 ('< CIN2') or images that were negative on DC-VIA (and therefore represent participants who did not receive a biopsy).
Smartphone images were labeled with the study ID (and no other identifiers) and uploaded directly from the smartphones to our study's machine learning team via the Box.comApp.Linked, de-identified participant data with key demographic and clinical details (age, HIV status, DC-VIA result, HPV test result, histopathology result if applicable) were shared electronically with the machine learning team via Excel files.
Our study objective was to train, validate, and test an algorithm with high discrimination between cases and controls.This paper details the AVE algorithm developed and tested using images from the Samsung A21s smartphone model.As detailed in Figure 1, participants were excluded if they (1) did not have a good quality A21s smartphone image or (2) had screened positive but did not have a histology result.
This resulted in 13484 A21s images, representing 5188 participants.These A21s images comprise the data set used for the training and evaluation of AVE algorithm reported here.The Samsung J8 images were used in preliminary development as described later in this section and in the supplementary. .

13
. A key difference between the two studies is that the work presented here is based on smartphone images whereas the Hu et al work was based on digitized cervigram images.Other differences include our study rescaling the images to a width of 900 pixels and height of 1200 pixels (with most images maintaining the original aspect ratio) and selecting a learning rate of 2e-5. Another

Analysis
We analyzed results on Samsung A21s images to support subsequent AVE validation studies on that device.
The primary measure of diagnostic accuracy is the Receiver Operating Characteristic (ROC) curve and its summary statistic, area under the curve (AUC).
AVE's sensitivity is defined as: # of histopathology CIN2+ participants identified as AVE positive # of histopathology CIN2+ participants AVE's specificity is defined as: # of reference negative* participants identified as AVE negative # of reference negative* participants *Reference negative is defined as EITHER histopathology result of no precancer/cancer (i.e., <CIN2) OR DC-VIA-negative screening result.This consists of the entirety of categories 1 and 2 from Table 1.
Study participants with a DC-VIA-positive result, but lacking a histopathology result, were not included in the AVE test set and performance analysis.Participants without an HPV result or VIA result were included in the AVE test set and performance analysis when a histopathology result was available.
AVE is intended as a primary screening tool or as a triage test for women who are HPV-positive.Our training and testing data from Zambia were enriched with data from patients that underwent LEEP/LLETZ based on an initial screen.This was designed to help our study team collect a sufficient absolute number of images of CIN2+ to train the AVE algorithm.However, it also resulted in an image set with higher severity relative to a typical screening setting.In response, we "rebalanced" our test set to enable an accurate measure of AVE's performance in a primary screening or HPV+ triage setting that would be most typical for the geographies where AVE is intended to be used.See Supplement S3 for a description of this rebalancing process.
We also analyzed the results of AVE to validate different use cases; specifically, the use of AVE as a triage test for women who are HPV positive and the use of AVE among WLHIV.Tables 2 and 3 show the number of participants and images in the test sets for each of these use cases.

Public and patient involvement
Patients (i.e., participants of the study) were not involved in the design of this phase of the research study.However, we anticipate client involvement as we conduct implementation research because provider, client, and community partner acceptance will be key to the scale up and sustained use of AVE in LMICs.

Clinical results
Of the 8,204 women aged 25-55 who were screened with HPV testing and VIA with digital cervicography (DC-VIA), 4,390 (54%) were HIV positive; 3,388 (44%) were positive for HPV; and 3,149 (38%) were positive on DC-VIA.Of the participants who were DC-VIA-positive, 1,024 (33%) women were treated with thermal ablation and 1,807 (57%) were treated with excision.(A further 318 women (10%) were suspicious for cancer at the time of screening and referred to Gynecologic Oncology Unit at UTH-Women and Newborn Hospital for further evaluation and cancer treatment, as required.)Of the 3,121 women who underwent histopathology analysis, 817 (26%) were negative; 566 (18%) had CIN1; 1,337 (43%) had CIN2/3; 345 (11%) were found to have invasive cervical cancer; and 56 (2%) had unsatisfactory or invalid histopathology results. .

Machine learning results
We evaluated the accuracy of the AVE algorithms trained and tested using Samsung J8 images with various categorization splits (as described in the methods section and Supplement S1), for the detection of CIN2+ using Area Under the Receiver Operating Characteristic Curve (AUC).The AUCs of the relevant J8 algorithms are shown in Supplement S1 and based on this, the final model utilizing A21s images was trained on a split with A21s case images consisting of all CIN2+ categories, but A21s control images were limited to participants testing negative on DC-VIA and HPV test.
At the image level, the AUC of the model trained on A21s images when tested on the A21s test set was 0.91 (95% CI = 0.90 to 0.92).To get participant-level predictions, we chose the image with a predicted confidence score closest to zero (if the image was detected as normal cervix) or one (if the image was detected as precancerous cervix) for each participant.In other words, we chose the image with the "maximum confidence".We applied this method to the test set predictions, achieving a participant-leve AUC of 0.91 (95% CI = 0.89 to 0.92).This method represents a scenario in which up to three images are captured from each subjects and used in evaluation.We also explored the consistency of the output for each image.For participants in the test set with at least two suitable images, there was agreement of the precancer classification for all images for 88% of participants.
We then adapted the test set results to the general population setting, described in more detail in supplement S3.This resulted in a participant-level AUC of 0.91 (95% CI = 0.89 to 0.93), with the threshold based on maximizing Youden's index resulting in sensitivity of 85% (95% CI = 81% to 90%) and specificity of 86% (95% CI = 84% to 88%).We also evaluated performance among women testing positive for high-risk HPV, achieving an AUC of 0.87 (95% CI = 0.83 to 0.91), and among WLHIV, achieving an AUC of 0.91 (95% CI = 0.88 to 0.93).The ROC curves for general population screening, HPV triage, and among WLHIV settings are shown in Figure 3. Adding training data for the device under test ("primary device") improves the AUC, as expected, and less data from the primary device is needed if accompanied by data from a secondary device than if limited to only data from the primary device.For example, for the A21s test set, reaching an AUC of 0.87 when training with only A21s data was not achieved until training with 212 cases and 578 controls (at the patient level).However, when combined with a large J8 training data set (432 cases, 609 controls), the same AUC is achieved by adding only 53 A21s cases and 144 A21s controls.

V. Discussion
In this work, we developed AVE, an AI-based algorithm that can be adapted for running on a smartphone that can detect CIN2+ precancerous lesions on images of the cervix captured with low cost, commonly available smartphones.The study demonstrated that the performance of AVE for primary cervical cancer screening was high as a primary screening method for all women (AUC = 0.91, 95% CI = 0.89 to 0.93) and WLHIV (AUC = 0.91, 95% CI = 0.88 to 0.93) and was also high when used as a triage when women screening positive for high-risk HPV (AUC = 0.87, 95% CI = 0.83 to 0.91).These results underscore AVE's potential as a highly accurate, accessible tool for screening precancerous cervical lesions and the technology's broader potential to help accelerate the Global Strategy to Eliminate Cervical Cancer as a Public Health Problem. 14AVE-assisted VIA holds promise as an effective primary screening tool within geographies where HPV testing is not yet available to all women in need of screening.With a sensitivity of 85% (CI=81% to 90%) and specificity of 86% (CI = 84% to 88%) at a threshold that maximizes the Youden's index, the AVE algorithm represents a considerable improvement in sensitivity over VIA alone that has been reported to have a sensitivity of 66% (95% CI: 61%-71%) in WHO's meta-analysis, with "extreme heterogeneity" in reported results across studies (22% to 91%). 3 If the use is further clinically validated, the use of AVEassisted VIA as a primary screening test may enable women to be screened and treated for precancerous lesions by a more reliable and objective test in a single visit, a critical element of womancentered care that increases the likelihood that women in need of treatment can access it.AVE-assisted VIA can also be used as a triage test for women who are HPV-positive in countries that use a 'screentriage-treat' approach with HPV screening as the primary test.
This study, focused on data collected from experienced clinicians in Zambia, was an important first step  While we do not have data to suggest that cervical appearance varies based on women's ethnicity, race, or geography of origin, previous research has shown that the prevalence of underlying conditions with the potential to affect cervical appearancesuch as female genital schistosomiasis and chronic cervicitis -can vary by setting.Clinic environments, such as variations in lighting, exam room set-up, quality of acetic acid, and provider skill, may also vary by setting and could potentially affect algorithm performance.

8
With strong high-level advocacy from the Office of the First Lady of Zambia, leadership from the Ministry of Health, a robust community education program, and an innovative collaboration of traditional leaders (e.g., local tribal chiefs, traditional healers, traditional marriage counselors), a matrix of nurse-led sameday 'screen-treat' services (VIA, thermal ablation, loop electrosurgical excision procedure (LEEP)/large loop excision of the transformation zone (LLETZ), punch biopsy) was successfully scaled to 160 public health clinics across all of the country's 10 provinces.Altogether, these investments in human resource development, alongside electronic data collection and public sector pathology laboratory infrastructure, have enabled Zambia's cervical cancer prevention platform to now support extramurally funded clinical research, including randomized clinical trials and our scientific investigations involving AVE algorithms.

9 2 .
Participants whose transformation zones extended into the endocervical canal (TZ typell/lll) or whose lesions occupied more than three-fourth of the transformation zone, were offered LEEP/LLETZ.
to develop the algorithm from a clinical environment representing the proposed use case.It is important to highlight that the results presented are from internal validation, i.e. the reported results are based on testing the algorithm on a population drawn from the same study as the training data.External validation of the AVE algorithm, i.e. testing on data from a different study populations and settings, is needed to fully understand the generalization performance of the algorithm.Prior work has shown that performance typically drops in external validation.

15
HPV Assay is an automated laboratory test that detects DNA (deoxyribonucleic acid) from 14 high risk HPV types that are associated with cervical cancer.The test specifically identifies HPV types 16, 18, 31,45, 51, 52 while concurrently detecting types 33/58, 35/39/68, and 56/59/66.For the purpose of this analysis, the results were classified as being high-risk positive or negative.The assay was performed at the Women's and Newborn Hospital/University of North Carolina HPV Laboratory located on the campus of UTH.The study protocol was reviewed and approved by the University of Zambia Biomedical Research Ethics Committee, Zambian National Health Research Agency, and the University of North Carolina Institutional Review Board, in addition to scientific and programmatic reviews by various committees at funding agencies.All study participants provided written informed consent.Women who declined or were unable to provide written informed consent were offered screening and treatment for precancerous lesions in line with the standard of care in Zambia.This study was designed to collect data for training an AVE algorithm as well as evaluating the performance of the algorithm; as diagnostic accuracy is evaluated, this study is being reported in accordance with the Standards for Reporting Diagnostic accuracy studies (STARD) checklist (STARD supplement checklist) 11 .

Table 1 .
significant difference involved our experimenting with different subsets of cases and controls used for training the AVE algorithm.Each image is linked to the corresponding HPV test result, VIA result, and (where applicable) histology result.Based on these test results, participants can be divided into several distinct categories.For our study, participants were divided into the categories shown in For example, a study participant with an HPV negative and VIA negative result was labeled as "Category 1a" and a study participant with histology CIN3 was labeled as "Category 3b".While the overall target of our AVE model is to separate all CIN2+ subjects from all <CIN2 subjects, this finer categorization allows training models using specific categories, as described below.It also allowed us to estimate performance in different settings such as general population screening or HPV triage.It is possible that this finer categorization might enable training multiclass (rather than binary) models, and even finer subcategorization, for example through stratification of study participants by HPV genotype, is possible, but our current study did not explore these.
During initial model development, we trained binary models on the Samsung J8 images splitting the controls and cases in several different ways, as described in Supplement S1.Each model used a different definition of case versus control during the training and validation process.Importantly, while not all categories of images were used in the training and validation sets, all categories were used in the test set to assess each model's ability to distinguish CIN2+ from <CIN2 across a full range of potential disease states.The categorization split producing the model that performed best (defined as the highest AUC) on a Samsung J8 test set was that in which training controls were limited to HPV-and DC-VIA-images (category 1a) and the training cases included all CIN2+ images (categories 3 and 4).This same split across categories was then used to train models with Samsung A21s images.

Table 1 .
Participant categorizations.To explore device portability of the AVE models (how well a model trained on data from one device performs on data from another device), we evaluated the performance of models trained with J8 data only, A21s data only, and combinations of varying amounts of J8 and A21s data on the test sets from each smartphone.This work is described in detail in Supplement S2.

Table 2 .
Participants and image count of the AVE test set by HPV status.

Table 3 .
Participants and image count of the AVE test set by HIV Status.